Seurat 环境部署

mambaforge

mambaforge类似与anaconda或miniconda,但是自带mamba。mamba会比conda更快一些,至少在检索或安装卡顿方面明显由于conda。

创建conda 虚拟环境

mamba create -n R r-seurat

标准预处理工作流程

设置工作目录并加载R包

setwd("~/test/scRNA-seq/filtered_gene_bc_matrices/")

library(dplyr)
## 
## 载入程辑包:'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Seurat)
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, will retire in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## The sp package is now running under evolution status 2
##      (status 2 uses the sf package in place of rgdal)
## Attaching SeuratObject
library(patchwork)
library(ggplot2)

加载原始数据

# 加载10X Genomics 免费提供的外周血单核细胞 (PBMC) 数据集
# hg19文件夹内有三个文件
pbmc.data <- Read10X(data.dir = "./hg19/")

# 初始化一个Seurat对象
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')

质控

## 质控和细胞筛选

pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
head(pbmc@meta.data, 5)
##                  orig.ident nCount_RNA nFeature_RNA percent.mt
## AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
## AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
## AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
## AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
## AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898
## 小提琴图可视化QC
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

## FeatureScatter通常用于可视化特征-特征关系,但也可以用于对象计算的任何内容,例如对象元数据中的列,PC分数等

plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")

plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

plot1 + plot2

## 过滤具有独特特征计数超过 2,500 或少于 200 以及线粒体计数 >5% 的细胞
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

## 过滤后质控结果
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

plot3 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot4 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot3 + plot4

标准化

默认情况下采用全局尺度标准化(global-scaling normalization)方法LogNormalize,将每个单元格的特征表达测量值标准化为总表达,将其乘以比例因子(默认为 10,000),并对结果进行对数转换。标准化值存储在 pbmc[["RNA"]]@data 中。

pbmc_nd <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

## 注,上述代码设置了默认参数,也可直接 pbmc_nd <- NormalizeData(pbmc)

## 标准化后质控结果
VlnPlot(pbmc_nd, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

plot5 <- FeatureScatter(pbmc_nd, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot6 <- FeatureScatter(pbmc_nd, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot5 + plot6

识别高度可变的特征(特征选择)

计算数据集中表现出高细胞间差异的特征子集(即它们在某些细胞中高度表达,而在其他细胞中表达较低),在下游分析中关注这些基因有助于突出单细胞数据集中的生物信号。

FindVariableFeatures()函数中实现。默认情况下每个数据集返回 2,000 个特征用于下游分析,例如 PCA。

pbmc_fvf <- FindVariableFeatures(pbmc_nd, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc_fvf), 10)

# plot variable features with and without labels
plot7 <- VariableFeaturePlot(pbmc_fvf)
plot8 <- LabelPoints(plot = plot7, points = top10, repel = TRUE)
## When using repel, set xnudge and ynudge to 0 for optimal results
plot7 + theme(legend.position="top",legend.direction="vertical" ) + plot8 + theme(legend.position="top",legend.direction="vertical" )
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis

数据缩放(线性变换)

应用ScaleData() 函数进行数据线性变换,改变每个基因的表达,使细胞间的平均表达为 0,缩放每个基因的表达,使细胞间的方差为 1。此步骤在下游分析中给予同等的权重,因此高表达的基因不会占主导地位,结果存储在 pbmc[["RNA"]]@scale.data

all.genes <- rownames(pbmc_fvf)
pbmc_sd <- ScaleData(pbmc_fvf, features = all.genes)
## Centering and scaling data matrix

线性降维

对缩放后的数据执行PCA。默认情况下,仅将先前确定的变量特征用作输入,但如果您希望选择不同的子集,则可以使用 features 参数进行定义。

pbmc_pca <- RunPCA(pbmc_sd)
## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL, FTH1, LYZ, FCN1, S100A9, TYMP 
##     FCER1G, CFD, LGALS1, S100A8, CTSS, LGALS2, SERPINA1, IFITM3, SPI1, CFP 
##     PSAP, IFI30, SAT1, COTL1, S100A11, NPC2, GRN, LGALS3, GSTP1, PYCARD 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2, B2M, ACAP1, CD27, STK17A, CTSW 
##     CD247, GIMAP5, AQP3, CCL5, SELL, TRAF3IP3, GZMA, MAL, CST7, ITM2A 
##     MYC, GIMAP7, HOPX, BEX2, LDLRAP1, GZMK, ETS1, ZAP70, TNFAIP8, RIC3 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1, HLA-DRA, LINC00926, CD79B, HLA-DRB1, CD74 
##     HLA-DMA, HLA-DPB1, HLA-DQA2, CD37, HLA-DRB5, HLA-DMB, HLA-DPA1, FCRLA, HVCN1, LTB 
##     BLNK, P2RX5, IGLL5, IRF8, SWAP70, ARHGAP24, FCGR2B, SMIM14, PPP1R14A, C16orf74 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA, FGFBP2, CTSW, GNLY, B2M, SPON2 
##     CCL4, GZMH, FCGR3A, CCL5, CD247, XCL2, CLIC3, AKR1C3, SRGN, HOPX 
##     TTC38, APMAP, CTSC, S100A4, IGFBP7, ANXA1, ID2, IL32, XCL1, RHOC 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1, HLA-DPA1, CD74, MS4A1, HLA-DRB1, HLA-DRA 
##     HLA-DRB5, HLA-DQA2, TCL1A, LINC00926, HLA-DMB, HLA-DMA, CD37, HVCN1, FCRLA, IRF8 
##     PLAC8, BLNK, MALAT1, SMIM14, PLD4, P2RX5, IGLL5, LAT2, SWAP70, FCGR2B 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11, NRGN, GP9, RGS18, TUBB1, CLU 
##     HIST1H2AC, AP001189.4, ITGA2B, CD9, TMEM40, PTCRA, CA2, ACRBP, MMD, TREML1 
##     NGFRAP1, F13A1, SEPT5, RUFY1, TSC22D1, MPP1, CMTM5, RP11-367G6.3, MYL9, GP1BA 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1, CD74, HIST1H2AC, HLA-DPB1, PF4, SDPR 
##     TCL1A, HLA-DRB1, HLA-DPA1, HLA-DQA2, PPBP, HLA-DRA, LINC00926, GNG11, SPARC, HLA-DRB5 
##     GP9, AP001189.4, CA2, PTCRA, CD9, NRGN, RGS18, CLU, TUBB1, GZMB 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8, S100A4, GIMAP7, S100A10, S100A9, MAL 
##     AQP3, CD2, CD14, FYB, LGALS2, GIMAP4, ANXA1, CD27, FCN1, RBP7 
##     LYZ, S100A11, GIMAP5, MS4A6A, S100A12, FOLR3, TRABD2A, AIF1, IL8, IFI6 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY, CCL4, CST7, PRF1, GZMA, SPON2 
##     GZMH, S100A9, LGALS2, CCL3, CTSW, XCL2, CD14, CLIC3, S100A12, RBP7 
##     CCL5, MS4A6A, GSTP1, FOLR3, IGFBP7, TYROBP, TTC38, AKR1C3, XCL1, HOPX 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7, AQP3, CYTIP, RP11-290F20.3, SIGLEC10, HMOX1 
##     LILRB2, PTGES3, MAL, CD27, HN1, CD2, GDI2, CORO1B, ANXA5, TUBA1B 
##     FAM110A, ATP1A1, TRADD, PPA1, CCDC109B, ABRACL, CTD-2006K23.1, WARS, VMO1, FYB

Seurat 提供了几种有用的方法来可视化定义 PCA 的单元格和特征,包括 VizDimReduction()DimPlot()DimHeatmap()。特别是 DimHeatmap() 允许轻松探索数据集中异质性的主要来源,并且在尝试决定包含哪些 PC 进行进一步的下游分析时非常有用。单元格和特征均根据其 PCA 分数进行排序。将 cells设置为数字会在频谱两端绘制“极端”单元格,这会显着加快大型数据集的绘制速度。虽然显然是一种监督分析,但我们发现这是探索相关特征集的宝贵工具。

# 用几种不同的方法检查和可视化PCA结果
print(pbmc_pca[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1 
## Positive:  CST3, TYROBP, LST1, AIF1, FTL 
## Negative:  MALAT1, LTB, IL32, IL7R, CD2 
## PC_ 2 
## Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
## Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
## PC_ 3 
## Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
## Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
## PC_ 4 
## Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
## Negative:  VIM, IL7R, S100A6, IL32, S100A8 
## PC_ 5 
## Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
## Negative:  LTB, IL7R, CKB, VIM, MS4A7
VizDimLoadings(pbmc_pca, dims = 1:2, reduction = "pca")

DimPlot(pbmc_pca, reduction = "pca")

DimHeatmap(pbmc_pca, dims = 1, cells = 500, balanced = TRUE)

DimHeatmap(pbmc_pca, dims = 1:15, cells = 500, balanced = TRUE)

确定数据集维度

为了克服 scRNA-seq 数据的任何单个特征(feature)中广泛的技术噪音,Seurat 根据 PCA 分数对细胞进行聚类,每个 PC 本质上代表一个“元特征(metafeature)”,它结合了相关特征集的信息。因此,顶部主成分代表了数据集的稳健(robust)压缩。但是,我们应该选择包含多少个components? 10? 20? 100?

随机排列数据的子集(默认为 1%)并重新运行 PCA,构建特征分数的“空分布”,然后重复此过程。我们将“重要”PC 定义为那些具有丰富的低 p 值特征的 PC。

# 注意:对于大数据集,这个过程可能需要很长时间,为了方便起见,请注释掉。更多的近似技术,比如在ElbowPlot()中实现的技术,可以用来减少计算时间

pbmc_js <- JackStraw(pbmc_pca, num.replicate = 100)
pbmc_sjs <- ScoreJackStraw(pbmc_js, dims = 1:20)

JackStrawPlot()函数提供了一个可视化工具,用于将每个 PC 的 p 值分布与均匀分布(虚线)进行比较。 “重要”PC 将显示出具有低 p 值的功能的强大丰富性(虚线上方的实线)。在这种情况下,在前 10-12 个 PC 之后,重要性似乎急剧下降。

JackStrawPlot(pbmc_sjs, dims = 1:15)
## Warning: Removed 23493 rows containing missing values (`geom_point()`).

另一种启发式方法生成“Elbow plot”:根据每个主成分解释的方差百分比( ElbowPlot() 函数)对主成分进行排名。在此示例中,我们可以观察到 PC9-10 周围有一个“肘部”,这表明大部分真实信号是在前 10 个 PC 中捕获的。

ElbowPlot(pbmc_sjs)

细胞聚类

与 PhenoGraph 一样,我们首先基于 PCA 空间中的欧氏距离构建 KNN 图,并根据局部邻域中的共享重叠(杰卡德相似度Jaccard similarity)细化任意两个单元之间的边缘权重。此步骤使用 FindNeighbors() 函数执行,并将先前定义的数据集维度(前 10 个 PC)作为输入。

为了对细胞进行聚类,我们接下来应用模块化优化技术,例如 Louvain 算法(默认)或 SLM [SLM,Blondel 等人,统计力学杂志],迭代地将细胞分组在一起,目标是优化标准模块化函数。 FindClusters() 函数实现此过程,并包含一个分辨率参数,用于设置下游聚类的“granularity”,增加的值会导致更多的聚类。我们发现,将此参数设置在 0.4-1.2 之间通常会为大约 3K 细胞的单细胞数据集带来良好的结果。对于较大的数据集,最佳分辨率通常会增加。可以使用 Idents() 函数找到簇。

pbmc_fn <- FindNeighbors(pbmc_sjs, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
pbmc_fc <- FindClusters(pbmc_fn, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 2638
## Number of edges: 95927
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8728
## Number of communities: 9
## Elapsed time: 0 seconds
# 查看前5个细胞的cluster id

head(Idents(pbmc_fc), 5)
## AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
##                2                3                2                1 
## AAACCGTGTATGCG-1 
##                6 
## Levels: 0 1 2 3 4 5 6 7 8

非线性降维

Seurat 提供了几种非线性降维技术,例如 tSNE 和 UMAP,来可视化和探索这些数据集。这些算法的目标是学习数据的底层流形,以便将相似的单元放在低维空间中。上面确定的基于图的簇内的单元应共同定位在这些降维图上。作为 UMAP 和 tSNE 的输入,我们建议使用相同的 PC 作为聚类分析的输入。

pbmc_umap <- RunUMAP(pbmc_fc, dims = 1:10)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
## 21:15:33 UMAP embedding parameters a = 0.9922 b = 1.112
## 21:15:33 Read 2638 rows and found 10 numeric columns
## 21:15:33 Using Annoy for neighbor search, n_neighbors = 30
## 21:15:33 Building Annoy index with metric = cosine, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 21:15:34 Writing NN index file to temp file /tmp/RtmpnzqtYK/file65ce36d86468
## 21:15:34 Searching Annoy index using 1 thread, search_k = 3000
## 21:15:34 Annoy recall = 100%
## 21:15:34 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
## 21:15:34 Initializing from normalized Laplacian + noise (using irlba)
## 21:15:34 Commencing optimization for 500 epochs, with 105140 positive edges
## 21:15:37 Optimization finished
DimPlot(pbmc_umap, reduction = "umap")

# 保存为rds
saveRDS(pbmc_umap, file = "./pbmc_tutorial.rds")

寻找差异表达特征(cluster biomarkers)

Seurat 可以帮助您找到通过差异表达定义簇的标记。默认情况下,与所有其他细胞相比,它识别单个簇的阳性和阴性标记(在 ident.1 中指定)。 FindAllMarkers() 为所有集群自动执行此过程,但您也可以测试集群组之间的对比,或针对所有单元的测试。

min.pct 参数要求在两组细胞中以最小百分比检测到一个特征,并且 thresh.test 参数要求一个特征在两组细胞之间以一定量差异表达(平均)。两组。您可以将这两个值设置为 0,但时间会显着增加 - 因为这将测试大量不太可能具有高度歧视性的特征。作为加速这些计算的另一个选项,可以设置 max.cells.per.ident 。这将对每个身份类别进行下采样,使其单元格数量不超过所设置的值。虽然通常会出现功率损失,但速度的增加可能会很显着,并且差异表达程度最高的特征可能仍会上升到顶部。

# 找到C2的所有marker
cluster2.markers <- FindMarkers(pbmc_umap, ident.1 = 2, min.pct = 0.25)
head(cluster2.markers, n = 5)
##             p_val avg_log2FC pct.1 pct.2    p_val_adj
## IL32 2.892340e-90  1.2013522 0.947 0.465 3.966555e-86
## LTB  1.060121e-86  1.2695776 0.981 0.643 1.453850e-82
## CD3D 8.794641e-71  0.9389621 0.922 0.432 1.206097e-66
## IL7R 3.516098e-68  1.1873213 0.750 0.326 4.821977e-64
## LDHB 1.642480e-67  0.8969774 0.954 0.614 2.252497e-63
# 找出所有区分集群5与集群0和集群3的标记
cluster5.markers <- FindMarkers(pbmc_umap, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
head(cluster5.markers, n = 5)
##                       p_val avg_log2FC pct.1 pct.2     p_val_adj
## FCGR3A        8.246578e-205   4.261495 0.975 0.040 1.130936e-200
## IFITM3        1.677613e-195   3.879339 0.975 0.049 2.300678e-191
## CFD           2.401156e-193   3.405492 0.938 0.038 3.292945e-189
## CD68          2.900384e-191   3.020484 0.926 0.035 3.977587e-187
## RP11-290F20.3 2.513244e-186   2.720057 0.840 0.017 3.446663e-182
# 与所有剩余的细胞相比,找到每个集群的标记,只报告阳性的

pbmc.markers <- FindAllMarkers(pbmc_umap, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
pbmc.markers %>%
    group_by(cluster) %>%
    slice_max(n = 2, order_by = avg_log2FC)
## # A tibble: 18 × 7
## # Groups:   cluster [9]
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene    
##        <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>   
##  1 9.57e- 88       1.36 0.447 0.108 1.31e- 83 0       CCR7    
##  2 3.75e-112       1.09 0.912 0.592 5.14e-108 0       LDHB    
##  3 0               5.57 0.996 0.215 0         1       S100A9  
##  4 0               5.48 0.975 0.121 0         1       S100A8  
##  5 1.06e- 86       1.27 0.981 0.643 1.45e- 82 2       LTB     
##  6 2.97e- 58       1.23 0.42  0.111 4.07e- 54 2       AQP3    
##  7 0               4.31 0.936 0.041 0         3       CD79A   
##  8 9.48e-271       3.59 0.622 0.022 1.30e-266 3       TCL1A   
##  9 5.61e-202       3.10 0.983 0.234 7.70e-198 4       CCL5    
## 10 7.25e-165       3.00 0.577 0.055 9.95e-161 4       GZMK    
## 11 3.51e-184       3.31 0.975 0.134 4.82e-180 5       FCGR3A  
## 12 2.03e-125       3.09 1     0.315 2.78e-121 5       LST1    
## 13 3.13e-191       5.32 0.961 0.131 4.30e-187 6       GNLY    
## 14 7.95e-269       4.83 0.961 0.068 1.09e-264 6       GZMB    
## 15 1.48e-220       3.87 0.812 0.011 2.03e-216 7       FCER1A  
## 16 1.67e- 21       2.87 1     0.513 2.28e- 17 7       HLA-DPB1
## 17 1.92e-102       8.59 1     0.024 2.63e- 98 8       PPBP    
## 18 9.25e-186       7.29 1     0.011 1.27e-181 8       PF4

Seurat 有几种差异表达测试,可以使用 test.use 参数进行设置(有关详细信息,请参阅我们的 DE 小插图)。例如,ROC 测试返回任何单个标记的“分类能力”(范围从 0 - random, to 1 - perfect)。

提供了几种用于可视化标记表达的工具。 VlnPlot() (显示跨聚类的表达概率分布)和 FeaturePlot() (在 tSNE 或 PCA 图上可视化特征表达)是我们最常用的可视化。我们还建议探索 RidgePlot()CellScatter()DotPlot() 作为查看数据集的附加方法。

VlnPlot(pbmc_umap, features = c("MS4A1", "CD79A"))

# 绘制原始count数据
VlnPlot(pbmc_umap, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

FeaturePlot(pbmc_umap, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP",
    "CD8A"))

DoHeatmap() 为给定的单元格和特征生成表达式热图。在本例中,我们绘制每个簇的前 20 个标记(如果少于 20 个则为所有标记)。

pbmc.markers %>%
    group_by(cluster) %>%
    top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(pbmc_umap, features = top10$gene) + NoLegend()

将细胞类型标识分配给簇

new.cluster.ids <- c("Naive CD4 T", "CD14+ Mono", "Memory CD4 T", "B", "CD8 T", "FCGR3A+ Mono",
    "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc_umap)
pbmc_ri <- RenameIdents(pbmc_umap, new.cluster.ids)
DimPlot(pbmc_ri, reduction = "umap", label = TRUE, pt.size = 0.5) + NoLegend()

saveRDS(pbmc_ri, file = "./pbmc3k_final.rds")